2 Introduction

In this project we have implemented three different changepoint detection algorithms. While two of them are detecting two changepoints in a given data, the third one is working online and can detect as many as changepoints which are possible.

In this report, we are comparing these models and sharing statistical workflow of these algorithms.

3 Description of the Problem

Given a sequential data we might interest in the changes in the data, especially these data change in some way, such as a machinery system could goes to failure or economic trends could be changed. If we asume that data is a result from such a generative process, then we can find the changes in this generative process.

Changepoint detection models aim to detect the changes in signals or time series data. The models of changepoint detection may detect only one changepoint, or multiple changepoints. In our project we implemented three different models.

  • Gamma-Poisson Multiple Changepoint Detection for two changepoints
  • Hierarchical Gamma-Poisson Multiple Changepoint Detection for two changepoints

The third model we implemented is an online changepoint model.

  • Bayesian Online Changepoint Detection[*]

The details of the models are given in the following sections.

[*] Adams, Ryan Prescott, and David JC MacKay. “Bayesian online changepoint detection.” arXiv preprint arXiv:0710.3742 (2007).

4 Description of the Data

4.1 Coal Mining Dataset

We are using Coil Mining Disasters dataset from BaM library (BaM: Functions and Datasets for Books by Jeff Gill). The data is a vector of British Coal Mining Disasters in length 111.

The data is the number of deaths along a time period at the coil mine disasters.

Source: Lynn, R. and Vanhanen, T. (2001). National IQ and Economic Development. Mankind Quarterly LXI, 415-437.

5 Model Details and Stan Codes

5.1 Gamma-Poisson Multiple Changepoint Model

The assumptions of this model are: - There are two changepoints on the time series data, that is there are three intervals. - The data comes from a Poisson distribution in each interval. - The means of the Poisson distribution generated by underlying Gamma distributions.

The generative process of the model is:

\[ \begin{aligned} e & \sim \text { Gamma }\left(r_{e},1\right) \\ l & \sim \text { Gamma }\left(r_{l},1\right) \\ m & \sim \text { Gamma }\left(r_{m},1\right) \\ s_1 & \sim \text { Uniform }(1, T) \\ s_2 & \sim \text { Uniform }(1, T) \\ \end{aligned} \]

\[ D_{t} \sim \text { Poisson }( \lambda ) \quad \lambda = \left\{\begin{array}{lll}{e} & \text {if }\: t < s_1 \\ {l} & \text {if }\: s_1 <= t < s_2 \\ {m} & \text {else }\: \end{array}\right. \]

The likelihood of the model is:

\[ \begin{aligned} p(D | e, l,m) &= \sum_{s_1=1}^{T} \sum_{s_2=s_1}^{T} p(s_1, s_2, D | e, l, m) \\ &=\sum_{s=1}^{T} \sum_{s_2=s_1}^{T} p(s_1) p(s_2) p(D | s_1, s_2, e, l,m) \\ &=\sum_{s_1=1}^{T} \sum_{s_2=s_1}^{T} \text { Uniform }(s_1 | 1, T) \text { Uniform }(s_2 | 1, T) \prod_{t=1}^{T} \text { Poisson }\left(D_{t} | t<s_1 ?\; e : (t<s_2)\; ? \; l :\; m\right) \end{aligned} \]

We modified the single Exponential-Poisson changepoint[*,**] model to handle two changepoints. And modified the linear single changepoint algorithm to handle two changepoints. The computation of our code is quadratic, $ O((n^2+3n)/2)$, instead of \(O(n^3)\).

[*] Fonnesbeck, Chris, Anand Patil, David Huard, and John Salvatier. 2013. PyMC User’s Guide.

[**] Stan’s User Guide, Change Point Models; https://mc-stan.org/docs/2_21/stan-users-guide/change-point-section.html

The Stan code of Gamma-Poisson Multiple Changepoint Model for two changepoints is as follows:

## data {
##   int<lower=1> T;
##   int<lower=0> D[T];
##   real<lower=0> mean_D;
## }
## transformed data {
##   real log_unif;
##   log_unif = -log(T);
## }
## parameters {
##   real<lower=0> e;
##   real<lower=0> l;
##   real<lower=0> m;
## }
## transformed parameters {
##   matrix[T, T] lp;
## 
##   {
##     row_vector[T + 1] lp_e;
##     row_vector[T + 1] lp_l;
##     row_vector[T + 1] lp_m;
##     lp = rep_matrix(2*log_unif, T, T);
##     lp_e[1] = 0;
##     lp_l[1] = 0;
##     lp_m[1] = 0;
##       
##     for (t in 1:T) {
##       lp_e[t + 1] = lp_e[t] + poisson_lpmf(D[t] | e);
##       lp_l[t + 1] = lp_l[t] + poisson_lpmf(D[t] | l);
##       lp_m[t + 1] = lp_m[t] + poisson_lpmf(D[t] | m);
##     }      
##     for (s1 in 1:T){
##       for (s2 in s1:T){
##         lp[s1,s2] = lp_e[s1+1] - lp_l[s1+1] + lp_l[s2+1] - lp_m[s2+1] + lp_m[T+1];
##         lp[s2,s1] = -1e6;
##       }
##     }
##   }
## }
## model {
##   e ~ gamma(mean_D,1);
##   l ~ gamma(mean_D,1);
##   m ~ gamma(mean_D,1);
##   target += log_sum_exp(lp);
## }
## generated quantities {
##   int<lower=1,upper=T> s1s;
##   int<lower=1,upper=T> s2s;
##   int tmp;
##   tmp = categorical_logit_rng(to_vector(lp));
##   s1s = (tmp % T) == 0 ? T : (tmp % T);
##   s2s = (tmp / T ) == T ? T : (tmp / T  + 1) ;
## 
##   
## }

5.2 Hierarchical Multiple Changepoint Model

In the hierarchical multiple changepoint model, we assume that the shape parameters of the priors for the interval means have a common prior.

\[ \begin{aligned} r_{\{e, l, m\}} & \sim \text { Gamma } (\alpha ,1)\\ e & \sim \text { Gamma }\left(r_{e},1\right) \\ l & \sim \text { Gamma }\left(r_{l},1\right) \\ m & \sim \text { Gamma }\left(r_{m},1\right) \\ s_1 & \sim \text { Uniform }(1, T) \\ s_2 & \sim \text { Uniform }(1, T) \\ \end{aligned} \]

\[ D_{t} \sim \text { Poisson }( \lambda ) \quad \lambda = \left\{\begin{array}{lll}{e} & \text {if }\: t < s_1 \\ {l} & \text {if }\: s_1 <= t < s_2 \\ {m} & \text {else }\: \end{array}\right. \]

The Stan code of Hierarchical Gamma-Poisson Multiple Changepoint Model for two changepoints is as follows:

## data {
##   int<lower=1> T;
##   int<lower=0> D[T];
##   real<lower=0> mean_D;
## }
## transformed data {
##   real log_unif;
##   log_unif = -log(T);
## }
## parameters {
##   real<lower=0> e;
##   real<lower=0> l;
##   real<lower=0> m;
##   real<lower=0> r[3];
## }
## transformed parameters {
##   matrix[T, T] lp;
##   {
##     row_vector[T + 1] lp_e;
##     row_vector[T + 1] lp_l;
##     row_vector[T + 1] lp_m;
##     lp = rep_matrix(2*log_unif, T, T);
##     lp_e[1] = 0;
##     lp_l[1] = 0;
##     lp_m[1] = 0;
##       
##     for (t in 1:T) {
##       lp_e[t + 1] = lp_e[t] + poisson_lpmf(D[t] | e);
##       lp_l[t + 1] = lp_l[t] + poisson_lpmf(D[t] | l);
##       lp_m[t + 1] = lp_m[t] + poisson_lpmf(D[t] | m);
##     }      
##     for (s1 in 1:T){
##       for (s2 in s1:T){
##         lp[s1,s2] = lp_e[s1+1] - lp_l[s1+1] + lp_l[s2+1] - lp_m[s2+1] + lp_m[T+1];
##         lp[s2,s1] = -1e6;
##       }
##     }
##   }
## }
## model {
##   r ~ gamma(mean_D,1);
##   e ~ gamma(r[1],1);
##   l ~ gamma(r[2],1);
##   m ~ gamma(r[3],1);
##   target += log_sum_exp(lp);
## }
## generated quantities {
##   int<lower=1,upper=T> s1s;
##   int<lower=1,upper=T> s2s;
##   int tmp;
##   tmp = categorical_logit_rng(to_vector(lp));
##   s1s = (tmp % T) == 0 ? T : (tmp % T);
##   s2s = (tmp / T ) == T ? T : (tmp / T  + 1) ;
## }

5.3 Bayesian Online Changepoint Detection

In this section we will discuss about online chnage point algorithm and its Stan application with explaining Ryan P. Adams and David J.C. MacKay’s technical report on Bayesian online changepoint detection.

In contrast to previous model we won’t assume any change point in this model. Instead, the sequence of observation can be divided into non-overlapping subsequences. We will interested in the transition of this non-overlapping subsequences.

Bayesian online changepoint detection focues on run-length of the sub-sequences from last changepoint and run length is denoted as \(r_t\) and it is updated as follows:

\[\begin{eqnarray} P\left(r_{t} | r_{t-1}\right)=\left\{\begin{array}{ll}{H\left(r_{t-1}+1\right)} & {\text { if } r_{t}=0} \\ {1-H\left(r_{t-1}+1\right)} & {\text { if } r_{t}=r_{t-1}+1} \\ {0} & {\text { otherwise }}\end{array}\right. \end{eqnarray}\]

In which \(H(\tau)\) is hazard function which is dependend on current run length. For the simplicity it could be treated as a constant also and we will show it as \(H(\tau) = 1/\lambda\).

5.3.1 Recursive Run Length Estimation

If we assume that we can calculate the predictive distribution of most recent observation given the current run length then we can calculate the marginal predictive distribution by integrating over the posterior distribution.

\[\begin{eqnarray} P\left(x_{t+1} \mid \boldsymbol{x}_{1: t}\right)=\sum_{r_{t}} P\left(x_{t+1} \mid r_{t}, \boldsymbol{x}_{t}^{(r)}\right) P\left(r_{t} \mid \boldsymbol{x}_{1: t}\right) \end{eqnarray}\]

where posterior distribution of run length at time is;

\[\begin{eqnarray} P\left(r_{t} \mid \boldsymbol{x}_{1: t}\right)=\frac{P\left(r_{t}, \boldsymbol{x}_{1: t}\right)}{P\left(\boldsymbol{x}_{1: t}\right)} \end{eqnarray}\]

The joint distribution can be calculated by recursive algorithm which is similar to forward algorithm of the HMMs;

\[\begin{eqnarray} P\left(r_{t}, \boldsymbol{x}_{1: t}\right) &=& \sum_{r_{t-1}} P\left(r_{t}, r_{t-1}, \boldsymbol{x}_{1: t}\right) \\ &=&\sum_{r_{t-1}} P\left(r_{t}, x_{t} \mid r_{t-1}, \boldsymbol{x}_{1: t-1}\right) P\left(r_{t-1}, \boldsymbol{x}_{1: t-1}\right) \\ &=& \sum_{r_{t-1}} P\left(r_{t} \mid r_{t-1}\right) P\left(x_{t} \mid r_{t-1}, \boldsymbol{x}_{t}^{(r)}\right) P\left(r_{t-1}, \boldsymbol{x}_{1: t-1}\right) \end{eqnarray}\]

Ultimately, we want to infer both the run-length posterior distribution \(p(r_t \mid \boldsymbol{x}_{1:t})\) and the posterior predictive distribution \(p(x_{t+1} \mid \boldsymbol{x}_{1:t})\) so that we can predict the next data point given all the data we have seen so far.

We implemented two versions of this model In the first one, we have implemeted the analytical solution of the BOCM. In the second one, we have implemented the Monte Carlo solution of this model. In both of them observations are coming from again Poisson distribution, while the mean of the Poisson distribution comes from the underlying Gamma distribution.

The Stan code of the analytical solution of BOCM is as follows:

## data {
##   int<lower=1> T;
##   int<lower=0> D[T];
##   real<lower=0, upper=1> H;
## }
## transformed data {
##   real log_increase;
##   real log_decrease;
##   matrix[T+1, T+1] R;
##   matrix[T+1, T+1] alpha;
##   matrix[T+1, T+1] beta;
## 
##   log_increase = log(1 - H);
##   log_decrease = log(H);
##   R = rep_matrix(-1e20, T+1, T+1);
##   R[1,1] = 0;
##   alpha = rep_matrix(1, T+1, T+1);
##   beta = rep_matrix(1, T+1, T+1);
##   for (t in 1:T){
##     for (j in 1:t){
##       R[j+1, t+1] = R[j, t] + neg_binomial_lpmf(D[t] | alpha[j,t], beta[j,t]) + log_increase;
##       R[1, t+1] = log_sum_exp(R[1, t+1], R[j, t] + neg_binomial_lpmf(D[t] | alpha[1,t], beta[1,t]) + log_decrease);
##       alpha[j+1, t+1] = alpha[j, t] + D[t];
##       beta[j+1, t+1] = beta[j, t] + 1;
##     }
##   }
## }
## model {
## 
## }
## generated quantities {
##   vector[T] meas;
##   for (t in 1:T)
##     meas[t] = categorical_logit_rng(col(R, t+1)[1:t+1]);
## }

The Stan code of the Monte Carlo solution of BOCM is as follows:

## data {
##   int<lower=1> T;
##   int<lower=0> D[T];
##   real<lower=0> mean_D;
##   real<lower=0, upper=1> H;
## }
## transformed data {
##   real log_increase;
##   real log_decrease;
## 
##   log_increase = log(1 - H);
##   log_decrease = log(H);
## }
## parameters {
##   vector<lower=0>[T] alpha;
## }
## transformed parameters {
##   matrix[T+1, T+1] R;
##   R = rep_matrix(-1e20, T+1, T+1);
##   R[1,1] = 0;
##   for (t in 1:T){
##     for (j in 1:t){
##       R[j+1, t+1] = R[j, t] + poisson_lpmf(D[t] | alpha[t-j+1]) + log_increase;
##       R[1, t+1] = log_sum_exp(R[1, t+1], R[j, t] + poisson_lpmf(D[t] | alpha[t]) + log_decrease);
##     }
##   }
## }
## model {
##   alpha ~ gamma(mean_D,1);
##   target += log_sum_exp(R);
## }
## generated quantities {
##   vector[T] meas;
##   for (t in 1:T)
##     meas[t] = categorical_logit_rng(col(R, t+1)[1:t+1]);
## }

6 Prior Choices and Fitting the Models

6.1 Gamma-Poisson Multiple Changepoint Model

6.1.1 Prior Choices

We have Gamma priors for interval means and Uniform prior on changepoints. We are using the mean of the data as the Gamma shape for the interval means priors.

\[ \{r_e, r_l, r_m\} = \text {Gamma} (mean(D),1) \] Since we do not want to put any constraints on the changepoints, we are using uniform distribution over the data length.

6.2 Hierarchical Multiple Changepoint Model

6.2.1 Prior Choices

We are using the mean of the data as the shape of the Gamma distribution on interval mean prior Gamma distribution’s shape. That is we set \(\alpha = mean(D)\) We are using 1 as scale on all Gamma distribution. So, using the mean of the data as Gamma shape of the priors seems plausable for both models.

Since we do not want to put any constarints on the changepoints, we are using uniform distribution over the data length.

6.3 Bayesian Online Changepoint Detection

6.3.1 Prior Choices

We are using the mean of the data as the shape of the Gamma distribution on interval mean prior Gamma distribution’s shape. Additionaly we have a hazard prior, which corresponds to the rate of the changepoint. In our experiments we are using \(H=0.01\), which means a point is a changepoint with \(p=H\).

6.3.2 Fitting the Model

In the first part we find the analytical solution and we use fixed parameters.

We fit the model for 5 chain and 2000 iterations with 200 warm-up phase.

In this part, we use MCMC algorithm.

We fit the model for 5 chain and 2000 iterations with 200 warm-up phase.

7 Histograms and Divergences of Parameters

7.1 Gamma-Poisson Multiple Changepoint Model

7.1.2 Convergence Diagnostics

The summary of model fit is as follows:

## Inference for Stan model: multiple_changepoint_gamma.
## 5 chains, each with iter=5000; warmup=2000; thin=1; 
## post-warmup draws per chain=3000, total post-warmup draws=15000.
## 
##      mean se_mean    sd  2.5%   50%  97.5% n_eff Rhat
## e    3.11    0.00  0.30  2.57  3.10   3.73  6907 1.00
## l    1.20    0.01  0.42  0.79  1.11   2.69  1049 1.01
## m    0.52    0.01  0.31  0.15  0.44   1.12  2899 1.00
## s1s 38.40    0.12  4.94 29.00 39.00  45.00  1711 1.00
## s2s 89.26    0.51 18.16 40.00 96.00 106.00  1264 1.00
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 13 02:16:20 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

As we can see that the Rhat values for all parameters and changepoint estimations are 1, that is our model fitted very well.

The effective sample sizes are looking also good. The ration n_eff / n_transitions is greater than 0.001 for all parameters.

Here we check the tree depth, E-BFMI and divergences. All are looking good.

## 0 of 15000 iterations saturated the maximum tree depth of 10.
## E-BFMI indicated no pathological behavior.
## 0 of 15000 iterations ended with a divergence.

As we can see from the following plots of the samples, the plot of the chains of the mean of first interval indicates that it is converged. The other means of the second and third intervals are also converged but there are some divergences.

When we investigate the plots of the chains for first and second changepoint there are some divergences. However we can also see that the divergence is not random. The divergence on the first changepoint is less that the divergence of the second changepoint. As the histograms of the changepoints also shows that there is a slight probability that the first changepoint is closer to the beginning of the data. And also there is some probability that the second changepoint is around 40. This posterior distribution of the changepoints causes the divergence of the chains.

7.2 Hierarchical Multiple Changepoint Model

7.2.2 Convergence Diagnostics

The summary of the model fit is as follows:

## Inference for Stan model: hierar_multiple_changepoint_gamma.
## 5 chains, each with iter=6000; warmup=2000; thin=1; 
## post-warmup draws per chain=4000, total post-warmup draws=20000.
## 
##      mean se_mean    sd  2.5%   50%  97.5% n_eff Rhat
## e    3.13    0.00  0.31  2.58  3.12   3.74 11098    1
## l    1.21    0.01  0.45  0.80  1.11   2.90  1617    1
## m    0.45    0.01  0.32  0.11  0.37   1.08  4124    1
## s1s 38.19    0.12  5.36 23.00 39.00  45.00  2164    1
## s2s 90.89    0.38 16.89 40.00 96.00 104.00  1933    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Dec 13 02:21:03 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

As we can see that the Rhat values for all parameters and changepoint estimations are 1, that is our model fitted very well.

The effective sample sizes are looking also good. The ration n_eff / n_transitions is greater than 0.001 for all parameters.

Here we check the tree depth, E-BFMI and divergences. All are looking good.

## 0 of 20000 iterations saturated the maximum tree depth of 10.
## E-BFMI indicated no pathological behavior.
## 0 of 20000 iterations ended with a divergence.

The same case as in the Non-hierarchical model happens here. The best converged parameter is the mean of the first interval. The divergence on the second changepoint is the highest because of the two possible changepoint locations.

7.3 Bayesian Online Changepoint Detection

7.3.2 Convergence Diagnostics

## [1] 0

For analytical solution there is no need for convergence diagnostics because we do not have any random parameters.

## [1] 7199.828
## [1] 0

As we can see that the minimum Rhat values along all the parameters and changepoint estimations are 1, that is our model fitted very well.

The effective sample sizes are looking also good. The ration n_eff / n_transitions is greater than 0.001 for all parameters.

Here we check the tree depth, E-BFMI and divergences. All are looking good.

## 0 of 10000 iterations saturated the maximum tree depth of 10.
## E-BFMI indicated no pathological behavior.
## 0 of 10000 iterations ended with a divergence.

8 Bayesian Online Changepoint Detection (Syntetic Data)

In this section we have expanded our work to show how effectively Bayesian online changepoint detection algorithm is find more than one changepoints.

9 Conclusion

The multiple changepoint model (MCM) and hierarchical multiple changepoint model(HMCM) are performing in a similar fashion. However, when we compare the standard deviations of the parameters for MCM and HMCM, we see that HMCM has slightly lower standard deviation then MCM. It can also be seen from the histograms of the estimate for second changepoint, \(s_2\).

In the online changepoint model we have compared our MCMC draws results with analytical solution of the problem.

Overall, Bayesian Online Changepoint Model is superior to other models. It can find any number of changepoints in linear time and it converges much faster. In this project we implemeted Poisson observation and Gamma underlying process version of the algorithm.

We specifically used conjugate prior model in order to compare with the analytical solution. As a future work, more generic versions of the MCMC solution of the model can be implemented. Because it can be used with models which do not have conjugate priors. This allows more complex models.